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As  information  explodes  across  the  Internet  and  intranets,  information  retrieval 
(IR)  systems  must  cope  with  the  challenge  of  scale.  How  to  provide  scalable  perfor¬ 
mance  for  rapidly  increasing  data  and  workloads  is  critical  in  the  design  of  next  gen¬ 
eration  information  retrieval  systems.  This  dissertation  studies  scalable  distributed 
IR  architectures  that  not  only  provide  quick  response  but  also  maintain  acceptable 
retrieval  accuracy.  Our  distributed  architectures  exploit  parallelism  in  information 
retrieval  on  a  cluster  of  parallel  IR  servers  using  symmetric  multiprocessors,  and  use 
partial  collection  replication  and  selection  as  well  as  collection  selection  to  restrict 
the  search  to  a  small  percentage  of  data  while  maintaining  retrieval  accuracy. 

We  first  investigate  using  partial  collection  replication  for  IR  systems.  We  examine 
query  locality  in  real  systems,  how  to  select  a  partial  replica  based  on  relevance, 
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how  to  load-balance  between  replicas  and  the  original  collection,  as  well  as  updating 
overheads  and  strategies.  Our  results  show  that  there  exists  sufficient  query  locality 
to  justify  partial  replication  for  information  retrieval.  Our  proposed  replica  selection 
algorithm  effectively  selects  relevant  partial  replicas,  and  is  inexpensive  to  implement. 
Our  evidence  also  indicates  that  partial  replication  achieves  better  performance  than 
caching  queries,  because  the  replica  selection  algorithm  finds  similarity  between  non¬ 
identical  queries,  and  thus  increases  observed  locality. 

We  use  a  validated  simulator  to  perform  a  detailed  performance  evaluation  of 
distributed  IR  architectures.  We  explore  how  best  to  build  parallel  IR  servers  using 
symmetric  multiprocessors,  evaluate  the  performance  of  partial  collection  replication 
and  collection  selection,  and  compare  the  performance  of  partial  collection  replication 
with  collection  partitioning  as  well  as  collection  selection.  At  last  we  present  experi¬ 
ments  for  searching  a  terabyte  of  text.  We  also  examine  performance  changes  when 
we  use  fewer  large  servers,  faster  servers,  and  longer  queries. 

Our  results  show  that  because  IR  systems  have  heavy  computational  and  I/O 
loads,  the  number  of  CPUs,  disks,  and  threads  must  be  carefully  balanced  to  achieve 
scalable  performance.  Our  results  show  that  partial  collection  replication  is  much 
more  effective  at  decreasing  the  query  response  time  than  collection  partitioning  for  a 
loaded  system,  even  with  fewer  resources,  and  it  requires  only  modest  query  locality. 
Our  results  also  show  that  partial  collection  replication  performs  better  than  collection 
selection  when  there  exists  enough  query  locality,  and  it  performs  worse  when  the 
collection  access  is  fairly  uniform  after  collection  selection.  Finally  our  results  show 
that  replica  and  collection  selection  can  be  combined  to  provide  quick  response  time 
for  a  terabyte  of  text.  Changes  of  system  configurations  do  not  significantly  change 
the  relative  improvements  due  to  partial  collection  replication  and  collection  selection, 
although  they  affect  the  absolute  response  time. 
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